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* Unlike TM, NTM is completely differentiable 
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inhuman evolution (Pinker vs Chomsky) 
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— Long Short Term Memory RNN's designed to 
handle vanishing and exploding gradient 


— Natively handle variable length structures 
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External Input External Output 
Controller 


Read Heads Write Heads 
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e 3. Addressing 


— 1. Focusing by Content 
* Each head produces key vector k, of length M 


e Generated a content based weight w,“ based on 
similarity measure, using ‘key strength’ B, 
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e 3. Addressing 


— 2. Interpolation 
* Each head emits a scalar interpolation gate g, 


wi <— GW + (1 — g)w:-_1. 
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e 3. Addressing 


— 3. Convolutional shift 


e Each head emits a distribution over allowable integer 
Shifts s, 
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e 3. Addressing 
— 4. Sharpening 


* Each head emits a scalar sharpening parameter y, 
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e 3. Addressing (putting it all together) 


— This can operate in three complementary modes 
e A weighting can be chosen by the content system 
without any modification by the location system 
e A weighting produced by the content addressing 
system can be chosen and then shifted 
e A weighting from the previous time step can be rotated 


without any input from the content-based addressing 
system 
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e Controller Network Architecture 
— Feed Forward vs Recurrent 


— The LSTM version of RNN has own internal 
memory complementary to M 


— Hidden LSTM layers are ‘like’ registers in 
processor 

— Allows for mix of information across multiple 
time-steps 

— Feed Forward has better transparency 
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e Test NTM's ability to learn simple algorithms 
like copying and sorting 

e Demonstrate that solutions generalize well 
beyond the range of training 

e Tests three architectures 
— NTM with feed forward controller 
— NTM with LSTM controller 
— Standard LSTM network 
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e 1. Copy 


— Tests whether NTM can store and retrieve data 
— Trained to copy sequences of 8 bit vectors 


— Sequences vary between 1-20 vectors 
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Experiments 


e 2. Repeat Copy 


— Tests whether NTM can learn simple nested 
function 


— Extend copy by repeatedly copying input specified 
number of times 


— Training is a random-length sequence of 8 bit 
binary inputs plus a scalar value for # of copies 


— Scalar value is random between 1-10 
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Figure 7: Repeat Copy Learning Curves. 
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e 2. Repeat Copy 
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e 3. Associative Recall 
— Tests NTM's ability to associate data references 


— Training input is list of items, followed by a query 
item 


— Output is subsequent item in list 
— Each item is a three sequence 6-bit binary vector 


— Each ‘episode’ has between two and six items 
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Figure 10: Associative Recall Learning Curves for NTM and LSTM. 
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e 3. Associative Recall 
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Figure 11: Generalisation Performance on Associative Recall for Longer Item Sequences. 
The NTM with either a feedforward or LSTM controller generalises to much longer sequences 
of items than the LSTM alone. In particular, the NTM with a feedforward controller is nearly 
perfect for item sequences of twice the length of sequences in its training set. 
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Experiments 


e 4. Dynamic N-Grams 


— Test whether NTM could rapidly adapt to new 
predictive distributions 


— Trained on 6-gram binary pattern on 200 bit 
seguences 


— Can NTM learn optimal estimator 


Ni +3 


P(B = 1|Ni, No, c) = M+M+1 
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Figure 13: Dynamic N-Gram Learning Curves. 
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e 5. Priority Sort 
— Tests whether NTM can sort data 


— Inputis seguence of 20 random binary vectors, 
each with a scalar rating drawn from [-1, 1] 


— Target seguence is 16-highest priority vectors 
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Experiments 


e 6. Details 
— RMSProp algorithm 
— Momentum 0.9 
— All LSTM’s had three stacked hidden layers 
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e 6. Details 
Task #Heads Controller Size Memory Size Learning Rate Parameters 
Copy 1 100 128 x 20 1074 17, 162 
Repeat Copy 1 100 128 x 20 1074 16, 712 
Associative 4 256 128 x 20 1074 146, 845 
N-Grams 1 100 128 x 20 3 x 107° 14, 656 
Priority Sort 8 512 128 x 20 3 x 107° 508, 305 


Table 1: NTM with Feedforward Controller Experimental Settings 
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e 6. Details 
Task #Heads Controller Size Memory Size Learning Rate Parameters 
Copy 1 100 128 x 20 1074 67, 561 
Repeat Copy 1 100 128 x 20 1074 66,111 
Associative 1 100 128 x 20 1074 70, 330 
N-Grams 1 100 128 x 20 3 x 10-5 61, 749 
Priority Sort 5 2 x 100 128 x 20 3 x 107° 269, 038 


Table 2: NTM with LSTM Controller Experimental Settings 


Experiments 


e 6. Details 
Task Network Size Learning Rate Parameters 
Copy 3 x 256 3 x 107° 1, 352, 969 
Repeat Copy 3 x 512 3 x 107° 5,312, 007 
Associative 3 x 256 1074 1,344,518 
N-Grams 3 x 128 1074 331, 905 
Priority Sort 3 x 128 sx 10° 384, 424 


Table 3: LSTM Network Experimental Settings 


Conclusion 


e Introduced an neural net architecture with 
external memory that is differentiable end-to- 
end 


e Experiments demonstrate that NTM are 
capable of leaning simple algorithms and are 
capable of generalizing beyond training 
regime 


“ Again, it [the Analytical Engine] might act upon other things besides 


numbers...the engine might compose elaborate and scientific pieces of 


music of any degree of complexity or extent.” — Ada Lovelace 


